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ABSTRACT 


Automatic  speech  recognition  algorithms  generally  rely  on  the  assumption  that  for  the 
distance  measure  used,  intraword  variabilities  are  smaller  than  interword  variabilities  so 
that  appropriate  separation  in  the  measurement  space  is  possible.  As  evidenced  by  degra¬ 
dation  of  recognition  performance,  the  validity  of  such  an  assumption  decreases  from  sim¬ 
ple  tasks  to  complex  tasks,  from  cooperative  talkers  to  casual  talkers,  and  from  laboratory 
talking  environments  to  practical  talking  environments. 

This  report  presents  a  study  of  talker-stress-induced  intraword  variability,  and  an  algo¬ 
rithm  that  compensates  for  the  systematic  changes  observed.  The  study  is  based  on  Hidden 
Markov  Models  trained  by  speech  tokens  spoken  in  various  talking  styles.  The  talking 
styles  include  normal  speech,  fast  speech,  loud  speech,  soft  speech,  and  talking  with  noise 
injected  through  earphones;  the  styles  are  designed  to  simulate  speech  produced  under  real 
stressful  conditions. 

Cepstral  coefficients  are  used  as  the  parameters  in  the  Hidden  Markov  Models.  The  stress 
compensation  algorithm  compensates  for  the  variations  in  the  cepstral  coefficients  in  a 
hypothesis-driven  manner.  The  functional  form  of  the  compensation  is  shown  to  corre¬ 
spond  to  the  equalization  of  spectral  tilts. 

Preliminary  experiments  indicate  that  a  substantial  reduction  in  recognition  error  rate  can 
be  achieved  with  relatively  little  increase  in  computation  and  storage  requirements. 
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CEPSTRAL  DOMAIN  TALKER  STRESS  COMPENSATION 
FOR  ROBUST  SPEECH  RECOGNITION 


1.  INTRODUCTION 

Current  speech  recognition  systems  generally  degrade  significantly  in  performance  if  the 
systems  are  not  both  trained  and  tested  under  similar  talking  conditions.  A  major  reason  for  per¬ 
formance  degradation  when  testing  and  training  conditions  differ  is  that  people  speak  differently 
under  different  conditions.  Previous  research  has  demonstrated  that  differences  in  speech  patterns 
can  be  caused  by  psychological  or  emotional  stress,1'5  by  the  presence  of  intense  background 
noise,5'7  by  a  demanding  perceptual-motor  or  mental  task,2^  by  physical  exertion,2  and  by  natu¬ 
ral  inconsistencies  in  pronunciation.8’9  Despite  the  knowledge  that  speech  patterns  change  in 
stress  and  in  noise  and  the  demonstration  of  degraded  recognition  performance  in  stress,  little 
speech  recognition  research  has  been  directed  at  modeling  systematic  changes  observed  and  at 
developing  recognition  systems  that  are  resistant  to  such  changes. 

This  report  presents  a  study  of  talker-stress-induced  variations  in  speech  cepstral  coefficients, 
and  an  algorithm  that  compensates  for  systematic  (but  unknown  a  priori)  changes  observed.  The 
study  is  based  on  an  isolated-word  Hidden  Markov  Model  speech  recognizer  trained  by  speech 
spoken  in  various  talking  conditions.  The  organization  of  this  report  is  as  follows.  In  Section  2  a 
speech  data  base  that  has  been  used  in  the  study  is  described.  In  Section  3  a  baseline  Hidden 
Markov  Model  (HMM)  speech  recognizer  is  defined.  In  Section  4  a  multistyle  training  experi¬ 
ment  is  described.  The  success  of  the  multistyle  training  experiment  has  prompted  the  study  of 
the  stress-induced  changes  in  speech,  which  is  discussed  in  Section  5,  and  the  development  of  a 
stress  compensation  algorithm,  which  is  discussed  in  Section  6.  The  compensation  algorithm  may 
be  interpreted  as  an  adaptive,  word-hypothesis-driven  form  of  spectral  tilt  equalization.  Section  7 
presents  experimental  results. 

Except  otherwise  noted,  the  notations  used  in  this  report  are  as  follows:  boldfaced  upper 
case  letters  such  as  “A”  represent  matrices,  boldfaced  lower  case  letters  such  as  “b”  represent 
column  vectors,  and  lower  case  letters  such  as  “c”  represent  scalars.  Elements  of  matrices  and 
vectors  may  be  written  as  scalars  with  an  appropriate  number  of  subscripts.  The  lower  case  let¬ 
ters  i  and  j  are  used  as  indices,  the  upper  case  letters  N  and  M  are  used  to  indicate  dimensionali¬ 
ties.  Therefore,  the  matrix  A  may  be  written  as  [ajj]MN,  the  vector  b  may  be  written  as  [bJN. 
Lower  case  letters  followed  by  one  or  more  arguments  enclosed  in  parentheses,  such  as  “f(x),” 
represent  functions. 

2.  THE  ‘SIMULATED  STRESS”  SPEECH  DATA  BASE 

The  studies  and  experiments  conducted  in  this  research  were  based  on  the  “simulated 
stress”10  speech  data  base  recently  collected  by  Texas  Instruments,  Inc. 

In  this  data  base  stress-like  degradations  of  the  speech  signal  were  elicited  by  asking  the 
speaker  to  produce  speech  in  a  variety  of  styles  (normal,  fast,  loud,  soft,  and  shout)  as  well  as 
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with  95-dB  pink  noise  exposure  in  the  ear  to  produce  the  Lombard  effect.5  The  vocabulary  con¬ 
sisted  of  105  words,  including  monosyllabic,  polysyllabic,  and  confusing  words.  A  complete  list  of 
the  words  is  given  in  the  Appendix. 

The  data  base  was  divided  into  training  data  and  test  data.  Training  data  consisted  of  five 
samples  of  each  of  the  105  words  collected  in  a  random  order  under  normal  talking  conditions, 
and  test  data  consisted  of  two  samples  of  each  word  under  each  simulated-stress  condition.  Data 
were  collected  from  five  adult  males  and  three  adult  females  using  a  16-bit  A/D  converter, 
sampled  at  20-kHz  rate,  in  a  quiet  laboratory  environment.  The  data  were  downsampled  to 
8  kHz  for  laboratory  usage.  The  total  number  of  test  word  tokens  was  10,080. 

To  verify  the  effects  of  “simulated  stress,”  a  baseline  Hidden  Markov  Model  based  recog¬ 
nizer  (to  be  described  in  the  next  section)  has  been  tested  on  this  data  base.  Substitution  rate  (no 
rejection  or  deletion  was  allowed  in  the  experiments)  of  this  test  is  given  in  Table  I.  It  is  seen 
that,  relative  to  normally  spoken  speech,  the  error  rate  increases  significantly  for  the  various  style 
conditions.  It  is  the  purpose  of  this  research  to  understand  the  causes  for  the  performance  degra¬ 
dation  experienced,  and  to  develop  effective  means  to  compensate  for  them. 


TABLE  I 

Substitution  Rate  (Percent): 

The  "Simulated  Stress"  Data  Base 


Condition 

Norm 

Fast 

Loud 

Noise 

Soft 

Shout 

Avg5* 

Avg6t 

Baseline  HMM 

1.0 

6.1 

29.1 

19.6 

13.5 

86.4 

13.9 

25.9 

*  Avg5  is  the  average  error  rate  of  all  talking  conditons  except  shout, 
t  Avg6  is  the  average  error  rate  of  all  talking  conditions. 


3.  A  HIDDEN  MARKOV  MODEL  SPEECH  RECOGNIZER 

The  theory  of  Hidden  Markov  Models  and  the  application  of  HMM  to  automatic  speech 
recognition  can  be  found  in  a  number  of  papers.1  M6 

Figure  1  shows  the  type  of  HMM  we  used  in  this  research.  The  word  model  network  is  a 
linear  sequence  of  nodes  with  no  skip  branches.  The  model  is  intended  to  be  used  on  speech 
inputs  consisting  of  one  word  with  background  (silence)  at  each  end;  the  first  and  the  last  nodes 
are  background  nodes  to  provide  a  semi-open-endpoint  recognizer. 
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Figure  1.  The  10-node  left-to-righl  Markov  model. 


The  HMM  model  can  be  described  by  a  matrix-vector  pair{A,b}.  A  =  [a,j]  is  a  bidiagonal 
state  transition  matrix,  its  elements  are  given  by 


Pi  .  j  =  i  +  1 


1  -  Pi 
0 


j  =  i 

otherwise 


(1) 


Note  that  p;  is  the  transition  probability  from  node  i  to  node  i  +  1,  and  1  -  p,  is  the  self-looping 
probability  of  node  i.  At  each  state  a  vector  v  of  continuous  variables  is  observed.  The  probabi¬ 
listic  nature  of  the  vector  v  is  described  by  a  set  of  joint  probability  density  functions  b  =  [f;(v)J. 

The  observation  vector  v  contains  12  mel-frequency  cepstral  coefficients,  i.e.,  v  =  [vj]|2.  The 
cepstral  coefficients  are  similar  to  those  used  by  Davis  and  Mermelstein. 17  To  compute  a  cepstral 
vector,  160  speech  samples  were  read  from  the  input,  padded  with  96  zeros,  windowed  by  a  256- 
sample  Hamming  function  and  transformed  into  the  frequency  domain  via  256-point  FFT.  In  the 
frequency  domain,  magnitude  of  the  spectrum  is  squared,  multiplied  by  the  function 


g(0  =  1  + 


g 

250000 


(2) 


where  f  is  frequency  in  hertz,  to  boost  high  frequency  content.  Logarithms  of  the  frequency  sam¬ 
ples  are  then  taken.  A  set  of  24  triangular-shaped  windows  (see  Figure  2)  are  then  used  to  com¬ 
puter  averaged  log  spectral  parameters  x  =  [Xj]24-  Notice  that  the  bandwidths  of  the  windows 
increase  as  their  center  frequencies  increase;  the  areas  under  the  windows  are  kept  constant. 

From  these  averaged  log  spectral  parameters  x  the  cepstral  coefficients  Vj  are  computed  as 

24  1  7T 

Vj=  2  xkcos[j(k-T)— ]  j  =  1,2, ...  ,12  (3) 

k=l  L  24 

Each  node  of  the  Hidden  Markov  Model  is  represented  by  a  cepstral  vector  template  which 
in  turn  is  characterized  by  a  jointly  normal  distribution  with  mean  vector  c  and  covariance 
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Figure  2.  Filter  bank  used  in  Equation  (I).  (Courtesy  of  D  B.  Paul.) 


matrix  R.  In  our  model,  we  assume  that  all  off-diagonal  elements  of  R  are  zero.  [This  is  justified 
in  part  by  the  fact  that  the  result  of  the  cosine  transform  in  Equation  (4)  is  approximately  mutu¬ 
ally  uncorrelated.I8»19  Further  justification  is  provided  by  the  good  recognition  results  obtained 
by  us  and  by  others  using  this  assumption.]  With  this  assumption,  the  covariance  matrix  can  be 
reduced  to  a  vector  of  variances,  which  we  relabeled  as  r. 

In  summary,  the  node  transition  characteristic  of  a  Hidden  Markov  word  model  is  described 
by  the  transition  probability  matrix  A;  the  observation  parameter  statistics  of  each  node  are  de¬ 
scribed  by  the  cepstral  mean  vector  c  and  the  cepstral  variance  vector  r. 

Training  of  the  models  uses  the  forward-backward  algorithm,20  while  recognition  uses  the 
Viterbi  decoding  algorithm.21  Cepstral  coefficients  are  computed  once  every  10  ms. 

Since  the  recognizer  makes  a  forced  decision  on  each  input  word,  substitution  is  the  only 
type  of  error  considered  here. 

4.  AN  EXPERIMENT  ON  MULTISTYLE-TRAINED  HIDDEN  MARKOV 
WORD  MODELS 

A  number  of  different  training/ testing  procedures  could  be  used  to  improve  speech  recogni¬ 
tion  performance  under  stress.  Ideally,  a  recognition  system  could  be  trained  and  tested  under  the 
same  stress  condition.  This,  however,  is  often  not  possible.  A  second  alternative  is  to  use 
dynamic  adaptation  of  the  models  based  on  recent  recognition  results.  This,  again,  may  not  be  a 
satisfactory  solution  because  stress  conditions  are  transient.  However,  it  is  possible  to  use  multi¬ 
style  training.22  Multistyle  training  requires  a  talker  to  train  a  recognizer  using  words  spoken 
with  different  talking  styles  instead  of  using  words  all  spoken  normally.  It  has  been  found  to  be 
easy  for  a  talker  to  change  to  styles  such  as  fast,  slow,  loud,  and  soft,  producing  changes  in 
speech  characteristics  that  are  similar  to  changes  that  occur  under  stress.  It  remains  to  be  demon- 
trated  that  multistyle  training  produces  improved  recognition  under  stress. 

An  experiment  on  multistyle-trained  Hidden  Markov  Model  word  recognition  was  per¬ 
formed.  In  this  experiment,  11  speech  tokens  were  used  to  train  each  word  model:  5  tokens  from 
the  training  data  base,  and  6  tokens,  one  per  talking  style  except  normal,  from  the  test  data  base. 
Recognition  tests  were  conducted  on  the  remaining  half  of  the  test  data  base.  The  recognition 
error  rates  are  listed  in  Table  II.  For  comparison,  the  error  rate  of  the  baseline  HMM  system  is 
also  included. 

In  comparing  experimental  results  listed  in  Table  II,  we  see  a  dramatic  improvement  in 
recognition  performance.  It  appears  that  the  HMM  word  models  were  able  to  assimilate  the  data 
from  the  multiple  styles  and  to  capture  statistically  the  more  invariant  features  of  each  word.  In 
the  next  section  we  investigate  the  gross  changes  of  model  parameters  resulting  from  multistyle 
training  as  well  as  from  style  training  (as  opposed  to  normal  training). 


5 


TABLE  II 

Substitution  Rate  (Percent):  A  Comparison  of  Normal- 
and  Multistyle-Trained  HMM  Recognizers 


Condition 

Norm 

Fast 

Loud 

Noise 

Soft 

Shout 

Avg5 

Avg6 

Baseline  HMM* 

1.0 

6.1 

29.1 

19.6 

13.5 

86.4 

13.9 

25.9 

Multistylet 

0.5 

5.6 

5.1 

2.1 

5.8 

43.6 

3.8 

10.5 

*  The  baseline  system  was  trained  with  5  normally  spoken  word  tokens  per  talker 
and  tested  on  10,080  test  tokens. 

t  The  multistyle-trained  system  was  trained  on  11  style  speech  tokens  per  talker 
and  tested  on  5,040  test  tokens. 


5.  CEPSTRAL  DOMAIN  STRESS  COMPENSATION  -  DRIVEN 
BY  OBSERVATIONS 

The  success  of  the  multistyle  training  experiment  motivated  a  comparison  of  the  model 
parameters  trained  under  various  talking  styles  to  determine  whether  it  would  be  possible  to 
compensate  for  the  cepstral  changes  through  simple  transformations  on  the  cepstral  means  and 
variances  obtained  using  normal  training.  Such  transformation,  if  effective,  would  eliminate  the 
need  for  asking  the  user  to  train  the  system  with  multiple  styles  and  for  incorporating  multiple 
style  data  in  the  Forward-Backward  training. 

The  differences  among  normally  trained,  single-style-trained,  and  multistyle-trained  word 
models  are  partially  reflected  in  the  average  shifts  of  the  mean  values  and  in  the  average  scaling 
of  the  variances  of  the  cepstral  coefficients.  To  study  such  differences,  seven  different  sets  of 
word  models  were  examined.  Six  of  the  models  were  trained  under  six  individual  conditions 
(normal  training,  fast,  loud,  Lombard,  soft,  and  shout,  respectively),  while  the  seventh  was 
trained  using  a  composite  of  all  these  conditions  (multistyle).  The  cepstral  means  and  variances, 
averaged  over  all  words  in  the  TI  vocabulary,  over  all  speech  nodes  in  each  word,  and  over  all 
talkers,  were  computed  for  each  of  the  models  above. 

The  mean  cepstral  shifts  (i.e.,  cepstral  means  of  the  given  model  minus  the  cepstral  means  of 
the  normal  model)  for  each  of  the  cepstral  coefficients  are  tabulated  in  Table  111(a).  Figure  3(a) 
plots  mean  cepstral  shifts  for  four  cases:  soft;  shout;  average  of  fast,  loud,  and  Lombard;  and 
multistyle.  Figure  3(b)  plots  the  corresponding  spectra  of  these  mean  shifts,  contrasting  the  effects 
on  spectral  tilt  of  low  vocal  effort  (soft)  vs  higher  vocal  effort  (fast,  loud,  Lombard,  and  shout). 
Increased  vocal  effort  increases  the  relative  high  frequency  content,  whereas  the  opposite  occurs 
with  low  vocal  effort. 
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TABLE  III 

Mean  and  Variance  Variations  in  Style-Speech 

(a)  Mean  Shifts  in  Cepstral  Domain 

* 

Coeff 

Fast 

Loud 

Lomb 

Soft 

Shout 

Multi 

AVG 

1 

-0.61 

-1.14 

-0.84 

0.90 

-3.08 

-0.61 

-1.07 

2 

-026 

-0.50 

-0.59 

0.43 

-1.94 

-0.43 

-0.59 

3 

-0.03 

-0.48 

-0.34 

0.37 

-1.28 

-0.26 

-0.37 

4 

-0.06 

-0.50 

-036 

0.49 

-1.09 

-0.25 

-0.39 

5 

-0.02 

-0.18 

-0.08 

0.27 

-0.53 

-0.10 

-0.13 

6 

-0.02 

-0.38 

-0.29 

0.27 

-0.70 

-0.17 

-0.29 

7 

-0.02 

-0.11 

-0.05 

0.12 

-0.19 

-0.03 

-0.07 

8 

0.04 

-0.18 

-0.06 

0.05 

-0.40 

-0.07 

-0.09 

9 

-0.05 

-0.16 

-0.08 

-0.01 

-0.32 

-0.06 

-0.12 

10 

-0.06 

-0.23 

-0.10 

-0.02 

-0.53 

-0.13 

-0.17 

11 

-0.04 

-0.13 

-0.11 

-0.03 

-0.20 

-0.10 

-0.13 

12 

-0.06 

-0.15 

-0.15 

004 

-0.09 

-0.05 

-0.14 

(b)  Ratio  of  Variance 

Coeff 

Multi 

Coeff 

Multi 

1 

2.07 

7 

1J 

55 

2 

1  84 

8 

1.70 

3 

1.75 

9 

1  84 

4 

1.71 

10 

1. 

77 

5 

1.47 

11 

1  83 

6 

1.62 

12 

2.10 

*  Because  the  differences 

in  the  fast,  loud,  and  Lombard  conditions 

are  reasonably  similar,  their  averages  are 

listed  in  the  last  column. 
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AMPLITUDE 


Figure  3.  Variations  of  cepstral  coefficient  compared  to  normally  spoken  w  ords,  (a)  Difference  of  mean  (style 
minus  normal),  (b)  spectra  of  differences  of  mean,  and  (c)  ratio  of  variance  ( muhistyle  normal). 
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It  is  well  known  that  spectral  tilt  exhibits  large  variation  when  a  talker  speaks  under  stress. 
Such  variation  usually  contaminates  the  distance  measure  and  is  one  of  the  most  significant 
causes  of  recognition  performance  degradation.  It  appears  that  the  effect  of  spectral  tilt  could  be 
compensated,  to  some  extent,  by  applying  the  appropriate  cepstral  compensation  to  normally 
trained  word  models. 

Because  variance  estimation  is  less  reliable  than  mean  estimation,  we  have  only  compared 
cepstral  variances  of  multistyle-trained  models  which  used  1 1  training  tokens  with  the  normally 
trained  models.  Their  ratios  (multistyle/normal)  are  listed  in  Table  1 1 1(b)  and  plotted  in  Fig¬ 
ure  3(c).  It  appears  that  the  major  style-induced  variations  occur  in  the  most  slowly  varying  spec¬ 
tral  components  (corresponding  to  lower  order  cepstral  coefficients),  and  in  the  most  rapidly  var¬ 
ying  spectral  components  (corresponding  to  the  higher  order  coefficients). 

The  following  cepstral  compensation  experiments  were  performed,  in  which  new  word  mod¬ 
els  were  generated  by  modifying  normally  trained  Hidden  Markov  word  models  by  one  or  more 
sets  of  cepstral  differences.  The  word  models  were  talker-dependent,  but  the  modifications  were 
the  same  for  all  words  and  all  talkers. 

(a)  Single  Model  Compensation: —  The  set  of  cepstral  mean  differences  and  var¬ 
iance  ratios  observed  in  multistyle-trained  models  [represented  by  filled 
squares  in  Figure  3(a)  and  (c)]  was  applied  as  compensation  in  recognition 
tests  on  all  styles. 

(b)  Multimodel  Compensation:—  Three  sets  of  cepstral  mean  compensations 
corresponding  to  the  soft,  the  loud,  and  the  shout-trained  models,  as  well  as 
cepstral  variance  ratios  for  multistyle-trained  models,  were  applied  to  generate 
three  new  word  models;  together  with  the  original  normally  trained  word 
model,  they  were  used  in  recognition  tests  on  all  styles.  In  recognition,  the 
four  models  were  used  for  each  vocabulary  word,  and  were  treated  indepen¬ 
dently  and  equally;  in  effect,  the  computation  for  HMM  recognition  was 
quadrupled. 

The  recognition  error  rates  of  these  experiments  are  listed  in  Table  IV  along  with  the  error 
rates  of  the  baseline  system  and  the  multistyle-trained  system  for  comparison.  The  error  rate 
reductions  relative  to  the  baseline  system  seem  quite  promising  given  the  simplicity  of  the  com¬ 
pensation  technique. 

It  is  not  clear  how  many  different  styles  it  would  be  useful  to  add  in  similar  experiments 
before  recognition  performance  would  start  to  decline.  However,  evidence  recently  gathered^ 
indicates  that  a  small  number  of  well-selected  styles  might  be  sufficient.  The  next  section  dis¬ 
cusses  a  variation  of  the  above  technique  —  hypothesis-driven  stress  compensation. 
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TABLE  IV 

Substitution  Rate  (Percent): 

A  Comparison  of  Fixed  Stress  Compensation 

Condition 

Norm 

Fast 

Loud 

Noise 

Soft 

Shout 

Avg5 

Avg6 

Baseline  HMM 

1.0 

6.1 

29.1 

19.6 

13.5 

86.4 

13.9 

25.9 

Multistyle 

0.5 

5.6 

5.1 

2.1 

5.8 

43.6 

3.8 

10.5 

Single  Model 

1.2 

4.6 

15.2 

12.2 

15.4 

79.5 

9.7 

21.4 

Multimodel 

1.0 

4.2 

12.1 

6.7 

5.5 

68.7 

5.9 

16.4 

6.  CEPSTRAL  DOMAIN  STRESS  COMPENSATION  -  A  HYPOTHESIS-DRIVEN 

APPROACH 

It  is  the  high  cost  of  increased  computation  and  the  uncertainty  about  training-style  suffi¬ 
ciency  and  efficiency  that  prompted  us  to  search  for  alternatives.  As  a  result  of  this  effort,  the 
hypothesis-driven  cepstral  mean  compensation  technique,  which  adapts  to  the  input  speech  and  to 
the  hypothesized  reference  word,  was  developed.  Fixed  multistyle  variance  compensation  has  been 
found  beneficial  for  all  styles  and  will  be  used  in  conjunction  with  the  adaptive  mean  compensa¬ 
tion,  unless  stated  otherwise. 

As  depicted  in  Figure  4,  a  talker  is  modeled  as  an  information  source  that  puts  out  a 
sequence  of  deterministic  cepstral  vectors  Before  the  vectors  are  received  by  the  decoder, 

we  assume  that  they  undergo  two  stages  of  contamination. 


SOURCE 


DESTINATION 


Figure  4.  Model  of  the  contamination  of  cepstral  coefficients,  where  a  is  a  random  noise  factor 
and  \  is  a  deterministic,  but  unknown ,  stress  factor. 


*  The  subscript  t  is  an  index  of  time. 
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Stage  1 


A  sequence  of  independent  identically  distributed  (i.i.d.)  random  vectors  {6,}  is  added  to  the 
cepstral  sequence  |ct}  to  create  a  new  sequence  |ut}: 

ut  =  ct  +  5,  .  (4) 

The  sequence  {6t}  models  the  randomness  of  speech  cepstral  parameter  outputs;  its  elements  are 
assumed  to  be  normally  distributed  with  zero  mean  vector  and  diagonal  covariance  matrix  (see 
discussion  in  Section  3). 

Stage  2 

A  deterministic  but  unknown  vector  x  is  added  to  the  sequence  |ut}  to  create  the  observa¬ 
tion  sequence  { vt J ,  i.e., 

vt  =  ut  +  x  (5) 

The  vector  x  is  the  additive  “stress”  component.  It  is  assumed  to  have  the  functional  form 
[see  Figure  3(a)]: 

x,  =  a  e-bO-')  ,  (6) 

and  is  further  assumed  to  remain  unchanged  within  a  word  interval. 

Given  a  sequence  of  observations  vt,  t  =  1,2,. ..,T  we  have  developed  a  procedure  for  estima¬ 
tion,  based  on  maximum  likelihood  principles,  of  the  parameters  a  and  b  in  Equation  (6).  The 
remaining  part  of  this  section  deals  with  the  derivation  of  this  estimation  procedure;  readers  who 
are  only  interested  in  the  experimental  results  may  skip  to  the  next  section  without  loss  of 
continuity. 

Our  parameter  estimation  procedure  is  divided  into  two  steps,  the  estimation  of  Xi  and  the 
smoothing: 


Step  1  (Estimating  Xj) 

The  probability  density  function  of  the  observation  Vj  is  given  by 


f(Vj)  = 


1 


\/27rCTi 


exp 


(Vj  -  Cj  -  Xi)2 


2a. 


(7) 


The  likelihood  function  of  the  ith  observation  variable,  vj,  for  t=  1,...,T,  conditioned  on  both 
the  sequence  of  the  ith  cepstral  coefficient  Cjt,  t  =  1,...,T  and  the  ith  “stress”  component  Xi  is  given 
by 


i(vii,...,viTicil,...,cix,xi)=  n 


t=l  \/27roi 


exp 


(v.t  -  cit  -  Xi)2 


2a. 


(8) 
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Taking  logarithms  of  both  sides,  we  obtain  the  log  likelihood  function 

T  ^y.  _ 

L(viI,...,viT|  Cj|,...,ciT,Xj)  =  -T  logN/2^a.  -  J  ~ - '  I  ' 

t=l  2a; 


(9) 


The  Xi  that  maximize  Equation  (9)  is  the  classical  maximum  likelihood  estimate  of  Xp  it  is  given 
by 

1  I 

Xt  =  ~  X  (v>‘  “  cit> 

t=I 

1  T  1  T 

-j-  ^  vit -  y  ^  cit 
t=l  t=I 

We  replace  the  sample  average  of  Cj,  which  is  not  observable,  by  the  expected  average  value, 
drived  from  the  word  hypothesis: 


(10) 


ci=E 


2  t> 


i  C; 


2tj 

j 


=  2  E 


2ri 

j 


(ii) 


where  the  7-,’s  are  a  set  of  mutually  independent  discrete  random  variables  whose  values  represent 
the  dwell  time  in  each  of  the  i  nodes,  and  the  summations  are  over  all  speech  nodes. 


In  Equation  (11),  7-j  is  known  to  have  geometric  probability  mass  function 
Pr.(k)  =  Pj(l  -Pi)k-1  ,  k  =  1,2,..., 


(12) 


where  1  -  p,  is  the  self-looping  probability  of  node  i.  If  the  Hidden  Markov  Model  has  N  speech 

N 

nodes  and  we  let  the  random  variable  y  -  ^  7-j,  it  can  be  shown  that  if  P(  #  P2  A..#  PN  the 


i=l 


probability  mass  function  of  y  is  the  mixture  distribution 


N 


Py(k)=  2  wjPri(k)  ,  k  =  N,  N  +  1,..., 
i=l 


(13) 
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where 


Wj  = 


N 

n 

j=i 


Pj(l  ~Pi) 
Pj  “Pi 


(14) 


Since  a  closed  form  formula  for  E 
to  the  second-order  moments.  Let 


-jhas  not  been  found,  we  use  an  approximation  using  up 
J 


N 


7i  =  X  rJ 


g(Ti-7i)  - 


ri +  y\ 


then  t j  and  yx  are  independent  and  the  expectation  can  be  approximated  by 


E[g(Ti>7i)]  5=5  g(fi,7i)  + 


1 


92g  ,  92g' 


9r? 


+  o 


7i 


dy? 


—  2  —  2 

riCT7i  ~  Wj 


ri  +  7i  (r  j  +  7i)3 

with  the  means  and  variances  given  by 

1 

ri  -  ~ 

Pi 


N 


1 


7i  -  2  — 


(i  -  Pi) 


p? 


l«V  2 

j=l 


N  (1  -  Pj) 


i=. 


(15) 


(16) 


(17) 


(18) 
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(19) 


The  estimation  formula  (10)  becomes 
j  T  N 

xi=  j  X  vu -  2  E[s(Tp”yi)] Cj 

t=l  i=l 

In  Equation  (19)  the  first  sum  is  over  the  observed  cepstral  coefficient  sequence,  and  the  second 
sum  is  over  the  parameters  of  the  hypothesized  model.  Therefore,  we  refer  to  this  technique  as  a 
hypothesis-driven  technique. 


Step  2  (Smoothing  Xi) 


After  Xi viXi2  are  estimated,  we  fit  Equation  (6)  to  them.  A  least-mean-square  fit  requires 
solving  the  following  equations: 


a  = 


2  Xj  crXM) 
V  e-2b('-') 


(20a) 


and 


2  X,  e-Mi-D  _  J  i  XI  e-b(i-D 


v  e-21*-')  2  *  e‘2b0'') 


(20b) 


where  Equation  (20b)  can  be  solved  numerically.  A  less  computationally  intensive  and  yet  more 
robust  fit  (i.e.,  one  which  is  less  susceptible  to  the  effect  of  outlying  data)  is  given  by  fitting 
exponential  functions  to  all  pairs  {xPXj}>  i  ¥=■  j,  or  a  subset  of  these  paris,  and  then  by  averaging 
magnitudes  and  time  constants  of  the  fits.  We  have  chosen  to  fit  the  pairs  that  contain  xi  and 
one  of  X2>X3>X4  and  X5>  namely,  {xi.Xj}.  j  =  2, 3, 4, 5.  Therefore, 

Xi  Xj  >  0  and  |  xi  l>l  Xjl 
otherwise 
b^O 

otherwise  (21) 


bi  = 


aj  = 


Xl 

0 


and  a  and  b  are  the  average  of  nonzero  aj’s  and  bj’s. 
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7.  SUMMARY  OF  THE  PROCEDURE  FOR  CEPSTRAL  COMPENSATION* 

AND  THE  EXPERIMENTAL  RESULTS 

Given  the  cepstral  vectors  of  a  test  token  and  the  Hidden  Markov  word  model  for  a  refer¬ 
ence  (the  procedure  is  done  for  every  reference  word),  the  procedure  for  the  adaptive  cepstral 
compensation  and  recognition  is  described  as  follows: 

Step  1:  Compute  a  set  of  stress  components  [c.f.  Equation  (19)]. 

Step  2:  Smooth  the  stress  components  by  fitting  an  exponential  function  to  them 
[c.f.  Equations  (6)  and  (21)]. 

Step  3:  Subtract  the  values  of  the  exponential  function  from  the  cepstral  vectors 
of  the  test  token. 

Step  4:  In  recognition,  perform  likelihood  tests  using  the  compensated  test  tokens. 

In  Table  V  we  summarize  the  recognition  error  rates  when  the  hypothesis-driven  stress  com¬ 
pensation  is  applied  to  the  “simulated  stress”  data  base.  For  comparison  the  error  rates  of  the 


TABLE  V 

Substitution  Rate  (Percent): 

A  Comparison  of  Multimodel  Fixed  Stress  Compensation 
with  Hypothesis-Driven  Stress  Compensation 

Condition 

Norm 

Fast 

Loud 

Noise 

Soft 

Shout 

Avg5 

Avg6 

Baseline  HMM 

1.0 

6.1 

29.1 

19.6 

13.5 

86.4 

13.9 

25.9 

Multimodel 

1.0 

4.2 

12.1 

6.7 

5.5 

68.7 

5.9 

16.4 

Hypothesis-Driven 

0.9 

4.7 

12.7 

7.0 

5.7 

72.4 

6.2 

17.2 

baseline  and  of  multimodel  compensation  are  also  included.  This  technique  has  also  been  applied 
to  a  more  advanced  14-node,  fixed-variance  HMM  system16  whose  parameters  contain  cepstral 
coefficients  as  well  as  differential  cepstral  coefficients.  Because  cepstral  variances  are  fixed  in  this 
recognizer,  no  variance  scaling  is  performed.  The  recognition  results,  with  and  without  cepstral 
compensations,  are  listed  in  Table  VI. 


*  The  compensation  technique  is  not  restricted  to  a  HMM  baseline  system,  similar  estimation 
formula  can  be  derived  for  DTW  (dynamic  time  warping)  based  recognition  systems. 
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TABLE  VI 

Substitution  Rate  (Percent): 

An  Advanced  HMM  Recognizer 

Condition 

Norm 

Fast 

Loud 

Noise 

Soft 

Shout 

Avg5 

Avg6 

Without  Compensation 

0.4 

1.7 

3.4 

2.9 

4.4 

49.8 

2.5 

10.4 

With  Compensation 

0.4 

1.7 

3.4 

1.4 

2.4 

45.3 

1.9 

9.0 

8.  CONFIDENCE  INTERVAL  ANALYSIS  OF  EXPERIMENTAL  RESULTS 


We  wish  to  demonstrate  that  the  reductions  achieved  in  substitution  rate  are  statistically 
significant. 

Suppose  that  the  probability  of  a  substitution  error  is  p;  then  the  probability  of  committing 
k  errors  in  a  test  data  bse  of  n  tokens  is  given  by  the  binomial  function 

Pr(k  errors  in  n  tokens)  =  (^)  pk(  1  -  p)n‘k  (22) 

The  mean  and  variance  of  the  number  of  errors,  k,  are  given  by 


M  =  np 

°2-  np(l  -  p)  .  (23) 

For  large  n  we  approximate  the  binomial  function  by  the  Gaussian  probability  density 
N[np,np(l  -  p)].  The  confidence  interval  of  95%  is  given  by  (n  -  1.96  a,  (i  +  1.96  a).  We  define 
the  parameter  X  as  the  ratio: 

1.96  a 

X  = -  ,  (24) 

M 

so  that  the  95%  confidence  interval  becomes  (^  -  X/x,  ^  +  Xu).  Substituting  (23)  into  (24),  assum¬ 
ing  p  «  1, 


1.96 


(25) 


For  n  =  8400,  p  =  13.9%  (5-avg,  baseline  HMM)  and  p  =  2.5%  (5-avg,  advanced  HMM),  we  have 
X  =  5.7%  and  X  =  13.5%,  respectively.  The  95%  confidence  itnervals,  corresponding  to  p  =  13.9% 
and  p  =2.5%,  are  roughly  (13.1%,  14.7%)  and  (2.16%,  2.84%),  respectively.  In  Table  V,  the  substi¬ 
tution  error  rate  p  =  6.2%  (5-avg,  hypothesis-driven  compensation)  lies  well  outside  the  interval 
(13.1%,  14.7%);  similarly  in  Table  VI  the  error  rate  p  =  1.9%  (5-avg,  compensation)  lies  well  out¬ 
side  the  interval  (2.16%,  2.84%).  Hence  the  improvements  obtained  using  this  type  of  compensa¬ 
tion  are  statistically  significant. 
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9.  CONCLUSION 


Spectral  tilt  has  been  found  to  vary  significantly  for  speech  spoken  in  stressful  talking  envi¬ 
ronments.  We  studied  the  statistical  variations  of  cepstral  coefficients  embedded  in  the  framework 
of  Markov  models  and  found  that  the  observed  changes  in  cepstral  mean  values,  from  normal 
speech  trained  models  to  simulated-stress  trained  models,  corresponded  approximately  to  an 
exponential  type  of  spectral  tilt.  A  simple  and  efficient  compensation  technique,  the  hypothesis- 
driven  cepstral  compensation,  has  been  formulated.  Using  this  simple  compensation  technique, 
recognition  experiments  yielded  significant  reduction  in  error  rate. 

It  is  likely  that  further  improvement  may  be  achieved  via  reliable  silence/voiced/unvoiced 
separation  before  the  application  of  cepstral  coefficient  compensation,  with  compensation  for 
spectral  tilt  only  in  voiced  segments.  A  Bayes  estimate,  that  incorporates  and  updates  a  priori 
knowledge  of  the  distributions  of  the  stress  component  Xi  and  of  the  parameters  a  and  b  in  the 
smoothing  process,  may  also  be  superior  to  our  estimate.  Other  improvements  may  be  achieved 
through  more  detailed  understanding  and  modeling  of  speech  variations  in  stress. 
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APPENDIX 

LIST  OF  WORDS  IN  THE  “SIMULATED  STRESS”  VOCABULARY 


zero 

airspeed 

east 

mode 

sensor 

ten 

air 

echo 

narrow 

south 

one 

alpha 

elevation 

nav 

standby 

twenty 

altitude 

negative 

north 

start 

two 

auto 

erase 

north 

status 

thirty 

azimuth 

fix 

no 

steerpoint 

three 

back 

freeze 

off 

step 

forty 

bar 

fuel 

oh 

stop 

fifty 

bravo 

go 

out 

synthesis 

five 

break 

ground 

point 

target 

sixty 

change 

hello 

profile 

thousand 

six 

Charlie 

help 

quiet 

threat 

seventy 

combat 

history 

radar 

tracker 

seven 

comm 

hot 

range 

train 

eighty 

confirm 

hundred 

recall 

voice 

eight 

control 

inventory 

release 

weapon 

ninety 

cursor 

lock 

repeat 

west 

nine 

degrees 

map 

return 

white 

advise 

delta 

mark 

rubout 

wide 

affirmative 

destination 

medium 

select 

yes 
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